fix(keyshare): align halt failure propagation and dkg timeout budget#1530
fix(keyshare): align halt failure propagation and dkg timeout budget#1530hmzakhalid wants to merge 1 commit intomainfrom
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
📝 WalkthroughWalkthroughThis PR introduces derived, budget-based DKG timeout handling by replacing fixed per-stage timeouts with a DKG-window scheme (default 7200s) and per-phase cutoff percentages. A new timeout policy module computes timeouts for encryption key, threshold share, and decryption key collection phases, passed as explicit parameters to each collector during initialization. ChangesDKG Timeout Policy Integration
Sequence DiagramsequenceDiagram
participant TK as ThresholdKeyshare
participant TP as timeout_policy
participant EKC as EncryptionKeyCollector
participant TSC as ThresholdShareCollector
TK->>TK: Initialize state with dkg_started_at_unix_secs
TK->>TP: resolve_timeout(EncryptionKeyCollection, dkg_start_time)
TP-->>TK: DerivedTimeout { duration, description }
TK->>EKC: setup(..., timeout_duration)
EKC->>EKC: Store timeout, schedule expiration
TK->>TP: resolve_timeout(ThresholdShareCollection, dkg_start_time)
TP-->>TK: DerivedTimeout { duration, description }
TK->>TSC: setup(..., timeout_duration)
TSC->>TSC: Store timeout, schedule expiration
Note over EKC,TSC: Collection phases execute<br/>with derived timeouts
EKC->>EKC: timeout fires
EKC->>TK: EncryptionKeyCollectionFailed
TK->>TK: Emit E3Failed, republish telemetry, stop
Estimated Code Review Effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly Related PRs
Suggested Reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
crates/keyshare/src/threshold_keyshare.rs (1)
266-299:⚠️ Potential issue | 🟠 Major | ⚡ Quick winStart the DKG timeout clock on
CiphernodeSelected, not inThresholdKeyshareState::new.
resolve_timeout(...)subtracts elapsed time fromdkg_started_at_unix_secs, but Line 298 seeds that value when the actor/state is constructed. If the actor exists for any meaningful time before DKG actually begins, the collectors inherit a reduced or even exhausted budget and can timeout immediately on startup. Initialize this field toNonehere and stamp it whenhandle_ciphernode_selectedbegins the DKG flow, before the collectors are created.💡 Suggested direction
pub struct ThresholdKeyshareState { ... #[serde(default)] pub dkg_started_at_unix_secs: Option<u64>, ... } impl ThresholdKeyshareState { pub fn new( e3_id: E3id, party_id: PartyId, state: KeyshareState, threshold_m: u64, threshold_n: u64, params: ArcBytes, address: String, proof_aggregation_enabled: bool, ) -> Self { Self { e3_id, address, party_id, state, threshold_m, threshold_n, params, aggregated_pk: None, expelled_parties: HashSet::new(), honest_parties: None, - dkg_started_at_unix_secs: Some(now_unix_secs()), + dkg_started_at_unix_secs: None, proof_aggregation_enabled, } } }Then, in
handle_ciphernode_selected, setdkg_started_at_unix_secs = Some(now_unix_secs())before callingensure_collector(...)/ensure_encryption_key_collector(...).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/keyshare/src/threshold_keyshare.rs` around lines 266 - 299, The dkg_started_at_unix_secs is being initialized in ThresholdKeyshareState::new which causes resolve_timeout to undercount if the actor lives before DKG starts; change the initialization in ThresholdKeyshareState::new to set dkg_started_at_unix_secs to None, and then in handle_ciphernode_selected set self.dkg_started_at_unix_secs = Some(now_unix_secs()) immediately before calling ensure_collector(...) and ensure_encryption_key_collector(...), leaving resolve_timeout to compute elapsed correctly.
🧹 Nitpick comments (1)
crates/keyshare/src/timeout_policy.rs (1)
129-134: ⚡ Quick winWarn on invalid timeout env values instead of silently ignoring them.
Right now malformed or zero-valued overrides fall back to the default path with no signal. For a rollout-sensitive timeout policy, that makes misconfiguration very hard to spot in production. A small warning here would make bad overrides immediately visible.
💡 Suggested change
+use tracing::warn; + fn parse_env_secs(name: &str) -> Option<u64> { - std::env::var(name) - .ok() - .and_then(|value| value.parse::<u64>().ok()) - .filter(|secs| *secs > 0) + match std::env::var(name) { + Ok(value) => match value.parse::<u64>() { + Ok(secs) if secs > 0 => Some(secs), + _ => { + warn!(env = name, value = %value, "Ignoring invalid timeout override"); + None + } + }, + Err(_) => None, + } }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/keyshare/src/timeout_policy.rs` around lines 129 - 134, The parse_env_secs function currently drops malformed or zero values silently; update parse_env_secs to detect when the environment variable is present but fails to parse or is <= 0 and emit a warning (using the project's logging facility, e.g., tracing::warn! or log::warn!) that includes the variable name and the invalid value, then continue returning None for the default behavior; locate the function parse_env_secs and add the warning branch where .ok().and_then(...).filter(...) would otherwise swallow the bad input.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@crates/keyshare/src/threshold_keyshare.rs`:
- Around line 266-299: The dkg_started_at_unix_secs is being initialized in
ThresholdKeyshareState::new which causes resolve_timeout to undercount if the
actor lives before DKG starts; change the initialization in
ThresholdKeyshareState::new to set dkg_started_at_unix_secs to None, and then in
handle_ciphernode_selected set self.dkg_started_at_unix_secs =
Some(now_unix_secs()) immediately before calling ensure_collector(...) and
ensure_encryption_key_collector(...), leaving resolve_timeout to compute elapsed
correctly.
---
Nitpick comments:
In `@crates/keyshare/src/timeout_policy.rs`:
- Around line 129-134: The parse_env_secs function currently drops malformed or
zero values silently; update parse_env_secs to detect when the environment
variable is present but fails to parse or is <= 0 and emit a warning (using the
project's logging facility, e.g., tracing::warn! or log::warn!) that includes
the variable name and the invalid value, then continue returning None for the
default behavior; locate the function parse_env_secs and add the warning branch
where .ok().and_then(...).filter(...) would otherwise swallow the bad input.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: d59d5bea-4e0b-47f1-9124-05a09e817373
📒 Files selected for processing (7)
agent/flow-trace/04_DKG_AND_COMPUTATION.mdcrates/keyshare/src/decryption_key_shared_collector.rscrates/keyshare/src/encryption_key_collector.rscrates/keyshare/src/lib.rscrates/keyshare/src/threshold_keyshare.rscrates/keyshare/src/threshold_share_collector.rscrates/keyshare/src/timeout_policy.rs
closes #1531
Summary by CodeRabbit
New Features
Documentation
Bug Fixes